How the smt_im data were created from smart_ohio.csv
Today’s R Setup
knitr::opts_chunk$set(comment =NA)library(janitor)library(naniar)library(broom)library(car)library(gt)library(mosaic) ## for df_stats and favstatslibrary(mice) ## imputation of missing datalibrary(patchwork) library(rsample) ## data splittinglibrary(easystats)library(tidyverse) theme_set(theme_lucid())
The smt_im data
The smt_im data
894 subjects in Cleveland-Elyria with bmi and no history of diabetes (missing values singly imputed: assume MAR)
All subjects have hx_diabetes (all 0), and are located in the MMSA labeled Cleveland-Elyria.
Here, it doesn’t matter much whether we store the 1/0 in exerany as numeric or as a two-level factor in R. For binary variables, sometimes the numeric version will be more useful and sometimes a factor will be more useful.
Our covariate, fruit_day
We are mostly interested in whether accounting for the quantitative covariate fruit_day changes the modeled association of our key predictors with bmi.
Sometimes we center such a covariate (subtracting its mean.)
# A tibble: 10 × 5
# Groups: exerany [2]
exerany health n mean stdev
<dbl> <fct> <int> <dbl> <dbl>
1 0 E 18 19.2 1.22
2 0 VG 55 19.5 1.90
3 0 G 60 18.6 2.22
4 0 F 32 17.5 2.43
5 0 P 9 17.4 2.25
6 1 E 92 19.9 1.64
7 1 VG 189 19.6 1.74
8 1 G 150 18.8 1.91
9 1 F 49 19.5 1.94
10 1 P 16 19.3 2.79
Code for Interaction Plot
ggplot(summaries_1, aes(x = health, y = mean, col =factor(exerany))) +geom_line(aes(group =factor(exerany)), linewidth =2) +scale_color_viridis_d(option ="C", end =0.5) +labs(title ="Observed Means of 100/sqrt(BMI)",subtitle ="by Exercise and Overall Health")
Note the use of factor here since the exerany variable is in fact numeric, although it only takes the values 1 and 0.
Sometimes it’s helpful to treat 1/0 as a factor, and sometimes not.
Where is the evidence of serious non-parallelism (if any) in the plot (see next slide) that results from this code?
Code for Interaction Plot
Fitting a Two-Way ANOVA model for \(100/\sqrt{BMI}\)
Create our transformed outcome
We’ll want to do this in both our training and test samples.
influential observations (outliers, leverage and influence)
whether the residuals follow a Normal distribution
collinearity (variance inflation factor)
and a posterior predictive check of our predictions
My slides and check_model()
When building a regular HTML file, I would just use:
check_model(fit1, detrend =FALSE)
with #| fig-height: 9 at the start of the code chunk so that the plots are taller than the default height (thus easier to read) but I will split out the plots for slides.
Problem with check_model()
The problem with check_model() (particularly on Macs) now seems to be rectified. Update your packages (for instance, to performance version 0.13.0 or later) if you haven’t yet.
female is based on biological sex (1 = female, 0 = male)
exerany comes from a response to “During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?” (1 = yes, 0 = no, don’t know and refused = missing)
The variable is based on “Would you say that in general your health is …” using the five specified categories (Excellent -> Poor), numbered for convenience after data collection.
Don’t know / not sure / refused treated as missing.
Might want to run a sanity check here, just to be sure…
Checking health vs. genhealth
smt |>tabyl(genhealth, health) |>adorn_title()
health
genhealth E VG G F P NA_
1_Excellent 148 0 0 0 0 0
3_Good 0 0 274 0 0 0
2_VeryGood 0 324 0 0 0 0
4_Fair 0 0 0 112 0 0
5_Poor 0 0 0 0 35 0
<NA> 0 0 0 0 0 1
OK. We’ve adjusted the order to something more sensible, retained the missing value, and we have much shorter labels.
Multicategorical race_eth in raw smt
smt |>count(race_eth)
# A tibble: 6 × 2
race_eth n
<fct> <int>
1 White non-Hispanic 646
2 Other race non-Hispanic 22
3 Black non-Hispanic 167
4 Multiracial non-Hispanic 19
5 Hispanic 27
6 <NA> 13
“Don’t know”, “Not sure”, and “Refused” were treated as missing.
What is this variable actually about?
What is the most common thing people do here?
What is the question you are asking?
Collapsing race_eth levels might be rational for some questions.
We have lots of data from two categories, but only two.
Systemic racism affects people of color in different ways across these categories, but also within them.
Is combining race and Hispanic/Latinx ethnicity helpful?
It’s hard to see the justice in collecting this information and not using it in as granular a form as possible, though this leaves some small sample sizes. There is no magic number for “too small a sample size.”
Most people identified themselves in one category.
These data are not ordered, and (I’d argue) ordering them isn’t helpful.
Regression models are easier to interpret, though, if the “baseline” category is a common one.
Resorting the factor for race_eth
Let’s sort all five levels, from most observations to least…